Food Delivery Time Prediction - Analysis¶
Given the number of methods and models, the code may take a few minutes to execute.
If you want to view the results without running it, go to the file Food_delivery_analisis.html.
Setup¶
As a first step, we import the necessary libraries and tools.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from scipy.stats import chi2_contingency
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize, poly)
from functools import partial
from sklearn.model_selection import (cross_validate, KFold, ShuffleSplit, train_test_split, cross_val_score, GridSearchCV)
from sklearn.base import clone
from ISLP.models import sklearn_sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import confusion_matrix, mean_absolute_error, r2_score
The dataset Food_Delivery_Times is imported from the Git directory.
After examining the contents of the database, we observe the presence of null values. It has been decided to remove these rows.
Null values are removed because they can interfere with the calculations of the model, leading to inaccurate results.
Food = pd.read_csv("dataset/Food_Delivery_Times.csv")
print("Number of rows in the original dataset:", len(Food))
null_rows = Food.isnull().any(axis=1).sum()
print("Number of rows with missing values:", null_rows)
Food = Food.dropna()
print("Number of rows after removing missing values:", len(Food))
Food.head()
Number of rows in the original dataset: 1000 Number of rows with missing values: 117 Number of rows after removing missing values: 883
| Order_ID | Distance_km | Weather | Traffic_Level | Time_of_Day | Vehicle_Type | Preparation_Time_min | Courier_Experience_yrs | Delivery_Time_min | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 522 | 7.93 | Windy | Low | Afternoon | Scooter | 12 | 1.0 | 43 |
| 1 | 738 | 16.42 | Clear | Medium | Evening | Bike | 20 | 2.0 | 84 |
| 2 | 741 | 9.52 | Foggy | Low | Night | Scooter | 28 | 1.0 | 59 |
| 3 | 661 | 7.44 | Rainy | Medium | Afternoon | Scooter | 5 | 1.0 | 37 |
| 4 | 412 | 19.03 | Clear | Low | Morning | Bike | 16 | 5.0 | 68 |
Outliers are extreme values that can distort analysis and lead to misleading results.
We used the Interquartile Range (IQR) method to detect and remove them.
This ensures cleaner data and more reliable outcomes in the following analyses.
numeric_columns = Food.select_dtypes(include=['float64', 'int64']).columns
Food_original = Food.copy()
def remove_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
for col in numeric_columns:
Food = remove_outliers(Food, col)
plt.figure(figsize=(18, 12))
for i, col in enumerate(numeric_columns, 1):
plt.subplot(2, len(numeric_columns), i)
sns.boxplot(x=Food_original[col])
plt.title(f"{col} - Before")
plt.subplot(2, len(numeric_columns), len(numeric_columns) + i)
sns.boxplot(x=Food[col])
plt.title(f"{col} - After")
plt.tight_layout()
plt.show()
print("Number of rows after removing outliers", len(Food))
Number of rows after removing outliers 879
The boxplots show the distribution of numerical variables before and after outlier removal.
The dataset shows good quality as it contains few outliers.
Extreme points, especially in Delivery_Time_min, were removed, making the data more compact.
The cleaned data is now more consistent and less influenced by rare or abnormal values.
Simple Linear Regression¶
Before building regression models, we assess the quality of the dataset by analyzing the correlation between variables through the construction of a correlation matrix.
numeric_df = Food.select_dtypes(include=[np.number])
if numeric_df.shape[1] >= 4:
plt.figure(figsize=(10, 8))
corr = numeric_df.corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap of Numeric Features')
plt.show()
In our case, some variables show a clear linear relationship with delivery time, indicating their potential importance in the regression model.
Additionally, the predictors are not highly correlated with each other, suggesting that the dataset is suitable for linear regression and that multicollinearity is not a concern.
The three numerical variables (Distance_km, Preparation_Time_min, Courier_Experience_yrs) are selected, and the relationship between each of them and the response variable Delivery_Time_min is analyzed individually.
The variable Order_ID is completely irrelevant to our objective, so we do not take it into account.
X = pd.DataFrame({'intercept': np.ones(Food.shape[0]), 'Distance_km': Food['Distance_km']})
X[:4]
| intercept | Distance_km | |
|---|---|---|
| 0 | 1.0 | 7.93 |
| 1 | 1.0 | 16.42 |
| 2 | 1.0 | 9.52 |
| 3 | 1.0 | 7.44 |
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 26.8673 | 0.890 | 30.184 | 0.0 |
| Distance_km | 2.9180 | 0.077 | 37.746 | 0.0 |
Distance_km has a p-value of 0.0, indicating that it is highly significant.
The coefficient is positive, meaning that as the distance increases, the waiting time also increases.
The intercept is high, suggesting that distance alone does not sufficiently explain the delivery times.
Model prediction and confidence intervals¶
We define a linear regression design matrix including an intercept and the explanatory variable Distance_km using the ModelSpec class from ISLP.
This allows for consistent and automatic construction of the design matrix for both the original dataset and new observations.
We then generate predictions and 95% confidence intervals for new values of Distance_km based on the previously fitted model.
model = MS(['Distance_km'])
model = model.fit(Food)
X = model.transform(Food)
X[:4]
| intercept | Distance_km | |
|---|---|---|
| 0 | 1.0 | 7.93 |
| 1 | 1.0 | 16.42 |
| 2 | 1.0 | 9.52 |
| 3 | 1.0 | 7.44 |
new_df = pd.DataFrame({'Distance_km':[10, 12, 14, 16, 20]})
newX = model.transform(new_df)
newX
| intercept | Distance_km | |
|---|---|---|
| 0 | 1.0 | 10 |
| 1 | 1.0 | 12 |
| 2 | 1.0 | 14 |
| 3 | 1.0 | 16 |
| 4 | 1.0 | 20 |
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
array([56.047364 , 61.88336722, 67.71937045, 73.55537368, 85.22738013])
new_predictions.conf_int(alpha=0.05)
array([[55.18622211, 56.90850588],
[60.97123424, 62.79550021],
[66.66742956, 68.77131134],
[72.30423639, 74.80651096],
[83.48514315, 86.96961711]])
def abline(ax, b, m, *args, **kwargs):
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
ax = Food.plot.scatter('Distance_km', 'Delivery_Time_min')
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=3)
The scatter plot shows a positive correlation between distance in kilometers and delivery time in minutes.
The data points align well with the linear trend line, indicating that the model is accurate and useful for predicting delivery times based on distance.
Fitted values vs residuals¶
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(results.fittedvalues, results.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--')
<matplotlib.lines.Line2D at 0x249af84a850>
The residuals are spread around the zero line, indicating that the model fits the data well.
There are no clear patterns in the residuals, suggesting the model's assumptions about linearity are valid.
Leverage¶
infl = results.get_influence()
ax = plt.subplots(figsize=(6, 6))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag)
np.int64(354)
The leverage values are low, meaning no single data point has too much impact on the model.
The leverage level for outliers is 0.0045. Since almost all data points are below this threshold, the model is well-balanced.
X = pd.DataFrame({'intercept': np.ones(Food.shape[0]), 'Preparation_Time_min': Food['Preparation_Time_min']})
X[:4]
| intercept | Preparation_Time_min | |
|---|---|---|
| 0 | 1.0 | 12 |
| 1 | 1.0 | 20 |
| 2 | 1.0 | 28 |
| 3 | 1.0 | 5 |
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 41.2948 | 1.727 | 23.913 | 0.0 |
| Preparation_Time_min | 0.8703 | 0.093 | 9.322 | 0.0 |
Preparation_Time_min has a p-value of 0.0, indicating that it is highly significant.
The coefficient is positive, meaning that as the preparation time increases, the waiting time also increases.
The intercept is relatively high, suggesting that preparation time alone does not fully explain the total waiting time.
Model prediction and confidence intervals¶
model = MS(['Preparation_Time_min'])
model = model.fit(Food)
X = model.transform(Food)
X[:4]
| intercept | Preparation_Time_min | |
|---|---|---|
| 0 | 1.0 | 12 |
| 1 | 1.0 | 20 |
| 2 | 1.0 | 28 |
| 3 | 1.0 | 5 |
new_df = pd.DataFrame({'Preparation_Time_min':[10, 12, 14, 16, 20]})
newX = model.transform(new_df)
newX
| intercept | Preparation_Time_min | |
|---|---|---|
| 0 | 1.0 | 10 |
| 1 | 1.0 | 12 |
| 2 | 1.0 | 14 |
| 3 | 1.0 | 16 |
| 4 | 1.0 | 20 |
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
array([49.99800226, 51.73863378, 53.47926529, 55.21989681, 58.70115983])
new_predictions.conf_int(alpha=0.05)
array([[48.14830241, 51.84770211],
[50.12183792, 53.35542964],
[52.03868239, 54.91984819],
[53.87634173, 56.56345188],
[57.26216877, 60.1401509 ]])
def abline(ax, b, m, *args, **kwargs):
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
ax = Food.plot.scatter('Preparation_Time_min', 'Delivery_Time_min')
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=3)
The scatter plot shows a positive correlation between preparation time and delivery time in minutes.
The data points generally align with the linear trend line, suggesting that the model is reasonably good.
However, compared to the previous model, the dispersion of the data points appears to increase.
Additionally, there is noticeable vertical overlap between points at similar time values, indicating that other factors beyond distance may also influence delivery times.
Fitted values vs residuals¶
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(results.fittedvalues, results.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--');
The residuals appear to be fairly symmetrically distributed above and below the zero line, which supports the assumption of linearity in the model.
However, the residuals are widely spread, showing a noticeable vertical dispersion.
Leverage¶
infl = results.get_influence()
ax = plt.subplots(figsize=(6, 6))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag);
Leverage values across the dataset are low, indicating that no individual observation exerts excessive influence on the model's predictions.
The threshold for high leverage is 0.0045, and all points fall below this value.
From this point of view, the model is very good.
X = pd.DataFrame({'intercept': np.ones(Food.shape[0]), 'Courier_Experience_yrs': Food['Courier_Experience_yrs']})
X[:4]
| intercept | Courier_Experience_yrs | |
|---|---|---|
| 0 | 1.0 | 1.0 |
| 1 | 1.0 | 2.0 |
| 2 | 1.0 | 1.0 |
| 3 | 1.0 | 1.0 |
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 58.7422 | 1.328 | 44.222 | 0.000 |
| Courier_Experience_yrs | -0.5695 | 0.242 | -2.352 | 0.019 |
Courier_Experience_yrs has a p-value of 0.022, indicating that it is significant.
The coefficient is negative, meaning that as courier experience increases, the waiting time tends to decrease.
The intercept is relatively high, suggesting that courier experience alone does not fully account for the variability in waiting times.
Compared to the previous models, Courier_Experience_yrs has a smaller effect size and slightly higher uncertainty, as reflected in the standard error and t-statistic.
Model prediction and confidence intervals¶
model = MS(['Courier_Experience_yrs'])
model = model.fit(Food)
X = model.transform(Food)
X[:4]
| intercept | Courier_Experience_yrs | |
|---|---|---|
| 0 | 1.0 | 1.0 |
| 1 | 1.0 | 2.0 |
| 2 | 1.0 | 1.0 |
| 3 | 1.0 | 1.0 |
new_df = pd.DataFrame({'Courier_Experience_yrs':[1, 3, 5, 7, 10]})
newX = model.transform(new_df)
newX
| intercept | Courier_Experience_yrs | |
|---|---|---|
| 0 | 1.0 | 1 |
| 1 | 1.0 | 3 |
| 2 | 1.0 | 5 |
| 3 | 1.0 | 7 |
| 4 | 1.0 | 10 |
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
array([58.1727115 , 57.03372218, 55.89473287, 54.75574356, 53.04725959])
new_predictions.conf_int(alpha=0.05)
array([[55.95314347, 60.39227952],
[55.43959247, 58.62785189],
[54.49350217, 57.29596357],
[52.96882553, 56.54266159],
[50.14442453, 55.95009465]])
def abline(ax, b, m, *args, **kwargs):
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
ax = Food.plot.scatter('Courier_Experience_yrs', 'Delivery_Time_min')
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=3)
The plot shows that the trend line is slightly downward, indicating a weak negative relationship between preparation time and delivery time.
The data points are highly dispersed around the trend line, suggesting that the model does not adequately explain the variability in the data.
Additionally, there is considerable vertical overlap between the points, which indicates that factors beyond experience may significantly influence delivery times.
This suggests the possibility of a nonlinear relationship, and the model is likely the worst one so far.
Fitted values vs residuals¶
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(results.fittedvalues, results.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--');
In this case as well, the residuals are symmetric but very large, indicating poor accuracy.
Leverage¶
infl = results.get_influence()
ax = plt.subplots(figsize=(6, 6))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag);
No point has high leverage, meaning no single point dominates the model's estimation.
This is positive, but it also confirms that no single observation provides strong information to the model, which is consistent with the fact that the model does not adequately explain the variability in the data.
X = MS(['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs']).fit_transform(Food)
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 13.0071 | 1.320 | 9.854 | 0.0 |
| Distance_km | 2.9493 | 0.066 | 45.000 | 0.0 |
| Preparation_Time_min | 0.9273 | 0.051 | 18.096 | 0.0 |
| Courier_Experience_yrs | -0.4803 | 0.127 | -3.777 | 0.0 |
The results of the linear regression model indicate that distance, preparation time, and courier experience significantly influence food delivery time (p-values = 0).
Specifically, delivery time increases by approximately 3.00 minutes per additional kilometer.
It also increases by 1 minutes for each additional minute of preparation.
In contrast, each additional year of courier experience reduces delivery time by about 0.5 minutes.
print("R2", results.rsquared)
print("RSE", np.sqrt(results.scale))
R2 0.7269698515442016 RSE 11.024219845632128
The linear regression model explains approximately 73% of the variance in food delivery time, indicating a strong fit.
The residual standard error (RSE) is 11 minutes, suggesting a moderate prediction error given the likely range of delivery times.
Overall, the model performs well.
Qualitative predictors¶
Now we examine the remaining variables, which are categorical.
for col in Food.columns:
if Food[col].dtype == 'object':
print(f"{col}: {Food[col].unique()}")
Weather: ['Windy' 'Clear' 'Foggy' 'Rainy' 'Snowy'] Traffic_Level: ['Low' 'Medium' 'High'] Time_of_Day: ['Afternoon' 'Evening' 'Night' 'Morning'] Vehicle_Type: ['Scooter' 'Bike' 'Car']
Before creating the new model, we verify that the categorical variables are not strongly correlated with each other, ensuring that each one contributes useful information to the model.
Since the variables are not numerical, we cannot use a standard correlation matrix.
Instead, we use Cramér's V, a statistical measure of association between two categorical variables, which ranges from 0 (no association) to 1 (perfect association).
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x, y)
chi2, _, _, _ = chi2_contingency(confusion_matrix)
n = confusion_matrix.sum().sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))
cramer_matrix = pd.DataFrame(
np.zeros((len(categorical), len(categorical))),
index=categorical,
columns=categorical
)
for var1 in categorical:
for var2 in categorical:
cramer_matrix.loc[var1, var2] = cramers_v(Food[var1], Food[var2])
plt.figure(figsize=(10, 8))
sns.heatmap(cramer_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Cramér's V of Categorical Features")
plt.show()
The Cramér’s V matrix shows very low values, indicating that the categorical variables are essentially independent from each other.
No significant associations emerge, suggesting that each variable provides distinct and non-redundant information to the model.
Using the tools from ISLP, we build a regression model that includes all variables, both numerical and categorical.
Categorical variables are automatically handled by the library through one-hot encoding, a process that transforms each category into a separate binary variable.
This allows the model to interpret categorical information in a numerical form suitable for linear regression.
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience_yrs'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 18.4314 | 1.467 | 12.564 | 0.000 |
| Distance_km | 2.9055 | 0.056 | 51.622 | 0.000 |
| Weather[Foggy] | 8.0616 | 1.059 | 7.611 | 0.000 |
| Weather[Rainy] | 4.9402 | 0.832 | 5.938 | 0.000 |
| Weather[Snowy] | 10.2738 | 1.126 | 9.128 | 0.000 |
| Weather[Windy] | 1.9968 | 1.129 | 1.769 | 0.077 |
| Traffic_Level[Low] | -12.2613 | 0.869 | -14.102 | 0.000 |
| Traffic_Level[Medium] | -6.0433 | 0.865 | -6.984 | 0.000 |
| Time_of_Day[Evening] | 0.0829 | 0.826 | 0.100 | 0.920 |
| Time_of_Day[Morning] | -0.8775 | 0.817 | -1.074 | 0.283 |
| Time_of_Day[Night] | -1.4297 | 1.238 | -1.155 | 0.248 |
| Vehicle_Type[Car] | -0.3326 | 0.854 | -0.389 | 0.697 |
| Vehicle_Type[Scooter] | -1.3290 | 0.739 | -1.799 | 0.072 |
| Preparation_Time_min | 0.9447 | 0.044 | 21.507 | 0.000 |
| Courier_Experience_yrs | -0.5556 | 0.109 | -5.081 | 0.000 |
From the regression results, several observations can be made:
- Distance_km has a strong and significant positive effect on delivery time.
- Adverse weather conditions like Foggy, Rainy, and Snowy significantly increase delivery time, compared to Clear weather.
- Traffic_Level[Low] and Medium significantly reduce delivery time compared to High traffic, which is the baseline.
- Different times of the day (Morning, Evening, Night) do not show statistically significant differences (p-values > 0.05) from the baseline (Afternoon).
- Similarly, neither Car nor Scooter types show significant differences (p-values > 0.05) from the baseline vehicle (Bike), suggesting that vehicle type may not strongly influence delivery time in this model.
- Preparation_Time_min significantly increases delivery time, as expected.
- Interestingly, Courier_Experience_yrs is negatively associated with delivery time, indicating that more experienced couriers tend to deliver faster.
print("R2", results.rsquared)
print("RSE", np.sqrt(results.scale))
R2 0.803281642221906 RSE 9.416991168829025
The linear regression model now explains approximately 80% of the variance in food delivery time, indicating an even stronger fit compared to the previous model (73%).
The residual standard error is now 10 minutes, suggesting a slight improvement in prediction accuracy, with a lower error compared to the previous model (11 minutes).
Overall, the model has improved in both explanatory power and prediction accuracy.
Observing the results, we notice that the p-values for Veicle_Type and Time_of_Day are high.
Before removing them, we aim to better understand why vehicle type does not affect delivery time.
In real scenarios, using a car should reduce delivery time compared to using a bike.
print(Food['Vehicle_Type'].value_counts())
Vehicle_Type Bike 450 Scooter 259 Car 170 Name: count, dtype: int64
print(Food.groupby('Vehicle_Type', observed=False)['Delivery_Time_min'].mean())
Vehicle_Type Bike 56.473333 Car 56.841176 Scooter 54.965251 Name: Delivery_Time_min, dtype: float64
print(Food.groupby('Vehicle_Type', observed=False)['Distance_km'].mean())
Vehicle_Type Bike 10.001067 Car 10.195471 Scooter 9.931197 Name: Distance_km, dtype: float64
We obtained the following results:
- As previously mentioned, Cramer's V matrix shows that the association between Vehicle_Type and other categorical variables is nearly nonexistent (maximum value 0.05).
- The distribution of Vehicle_Type is unbalanced: 450 deliveries by bike, 259 by scooter, and 170 by car, despite this, each group has sufficient size to estimate means robustly.
- The analysis of average delivery times per vehicle type revealed similar values: 56.47 minutes for bike, 54.96 for scooter, and 56.84 for car.
- Similarly, the average distance traveled by each vehicle type was nearly the same (around 10 km) and this suggests that vehicle type is not associated with relevant differences in either distance or average time.
These results justify the high p-value and the absence of effect may indicate that logistic differences between vehicles are offset by other operational factors not included in the dataset (e.g., city routes, assigned paths, traffic, etc.).
We now return to the model and remove, as previously anticipated, Vehicle_Type and Time_of_Day
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience_yrs',
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 17.5478 | 1.348 | 13.019 | 0.000 |
| Distance_km | 2.9057 | 0.056 | 51.668 | 0.000 |
| Weather[Foggy] | 8.0511 | 1.059 | 7.602 | 0.000 |
| Weather[Rainy] | 4.9947 | 0.831 | 6.011 | 0.000 |
| Weather[Snowy] | 10.4361 | 1.120 | 9.318 | 0.000 |
| Weather[Windy] | 2.0469 | 1.126 | 1.818 | 0.069 |
| Traffic_Level[Low] | -12.2128 | 0.869 | -14.056 | 0.000 |
| Traffic_Level[Medium] | -6.0979 | 0.865 | -7.049 | 0.000 |
| Preparation_Time_min | 0.9446 | 0.044 | 21.509 | 0.000 |
| Courier_Experience_yrs | -0.5510 | 0.109 | -5.040 | 0.000 |
print("R2", results.rsquared)
print("RSE", np.sqrt(results.scale))
R2 0.8019562697725547 RSE 9.421439203442182
By removing the two variables, we obtained a simpler and more interpretable model.
This was achieved without losing the good performance of the previous full model: R² and RSE values remain nearly identical to before.
y_pred = results.fittedvalues
plt.figure(figsize=(6, 4))
plt.scatter(y, y_pred, alpha=0.6, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title('Observed Values vs Predicted Values')
plt.grid(True)
plt.show()
residuals = results.resid
plt.figure(figsize=(6, 4))
plt.scatter(y_pred, residuals, alpha=0.6, color='darkorange')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.grid(True)
plt.show()
sns.histplot(residuals, kde=True, stat="density", color="skyblue", bins=30)
plt.title("Residual Distribution")
plt.xlabel("Residual")
plt.ylabel("Density")
plt.show()
Observing the graphs, we noticed the presence of outliers.
To improve the robustness of the regression model, we removed outliers based on the distribution of residuals from an initial OLS fit.
Residuals are assumed to follow a normal distribution (as confirmed by the previous plot above), and a threshold based on the standard deviation is applied to identify anomalous observations.
Instead of the conventional ±3 standard deviations, we adopted a slightly more restrictive cutoff of ±2.9 standard deviations.
This choice allowed us to remove all significant outliers without discarding a substantial portion of the data.
residuals = results.resid
residuals_std = np.std(residuals)
threshold = 2.9 * residuals_std
mask = np.abs(residuals) < threshold
Food = Food.loc[mask].copy()
print(f"Number of rows after removing outliers: {len(Food)}")
Number of rows after removing outliers: 860
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience_yrs',
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 14.5238 | 0.966 | 15.037 | 0.0 |
| Distance_km | 2.9635 | 0.040 | 74.029 | 0.0 |
| Weather[Foggy] | 6.3084 | 0.767 | 8.225 | 0.0 |
| Weather[Rainy] | 4.7894 | 0.592 | 8.095 | 0.0 |
| Weather[Snowy] | 9.9081 | 0.799 | 12.403 | 0.0 |
| Weather[Windy] | 2.9583 | 0.795 | 3.721 | 0.0 |
| Traffic_Level[Low] | -11.2013 | 0.621 | -18.025 | 0.0 |
| Traffic_Level[Medium] | -5.4008 | 0.619 | -8.725 | 0.0 |
| Preparation_Time_min | 0.9985 | 0.031 | 31.932 | 0.0 |
| Courier_Experience_yrs | -0.5361 | 0.078 | -6.883 | 0.0 |
print("R2", results.rsquared)
print("RSE", np.sqrt(results.scale))
R2 0.894431839175429 RSE 6.643357633887748
The R² value has improved by about 10%, nearly reaching 90%.
The error in delivery time is now approximately 6-7 minutes.
Additionally, only a few rows were removed, with the dataset decreasing from 879 to 860.
plt.figure(figsize=(6, 4))
plt.scatter(y, results.fittedvalues, alpha=0.5, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title('Observed vs Predicted Values (after cleaning)')
plt.show()
plt.figure(figsize=(6, 4))
plt.scatter(results.fittedvalues, results.resid, alpha=0.5, color='darkorange')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values (after cleaning)')
plt.show()
However, from the Residuals vs Predicted Values plot, we can observe a clear cone-shaped pattern, with residuals starting from 0 and ranging between -20 and +20.
This pattern indicates heteroscedasticity, meaning the variance of the residuals increases as the predicted values grow.
In other words, the model's error is not constant across all levels of prediction, but rather increases as the predicted values move further from 0.
A solution to this phenomenon is to apply logarithmic or square root transformations to the model.
These transformations stabilize the variance by compressing the scale of larger predicted values, making the residuals more homoscedastic and improving the model's assumptions.
def transform_y(y, method):
if method == 'log':
return np.log1p(y)
elif method == 'sqrt':
return np.sqrt(y)
else:
raise ValueError(f"Error")
transformations = ['log', 'sqrt']
results_summary = {}
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience_yrs',
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
for transform in transformations:
print(f"\nTransformation: {transform}")
y_transformed = transform_y(y, transform)
mod = sm.OLS(y_transformed, X)
res = mod.fit()
results_summary[transform] = res.rsquared
y_pred = res.predict(X)
if transform == 'log':
y_pred_plot = np.expm1(y_pred)
elif transform == 'sqrt':
y_pred_plot = np.square(y_pred)
else:
y_pred_plot = y_pred
plt.figure(figsize=(6, 4))
plt.scatter(y, y_pred_plot, alpha=0.5, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title(f'Observed vs Predicted - {transform}')
plt.grid(True)
plt.show()
residuals = y_transformed - y_pred
plt.figure(figsize=(6, 4))
plt.scatter(y_pred, residuals, color='darkorange', alpha=0.5)
plt.axhline(y=0, color='k', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title(f'Residuals vs Predicted - {transform}')
plt.grid(True)
plt.show()
print("\nR² for each transformation:")
for key, value in results_summary.items():
print(f"{key}: {value:.4f}")
Transformation: log
Transformation: sqrt
R² for each transformation: log: 0.8815 sqrt: 0.9027
Observing the results from the plots, no clear improvement is noticeable.
Moreover, the R² values are similar to the original model.
Therefore, we prefer to continue our work with the non-transformed model.
Cross-Validation¶
Cross-validation helps evaluate how well the model generalizes to unseen data.
By testing the model on a separate validation set, we can detect overfitting and assess real-world performance.
In this case, we split the dataset into a training set Food_train and a validation set Food_valid.
We explicitly define the size of the validation set (430 samples), which is exactly half of the total dataset, to ensure a consistent and controlled evaluation.
Food_train , Food_valid = train_test_split(Food, test_size=430, random_state=0)
categorical = ['Weather', 'Traffic_Level']
Food_train[categorical] = Food_train[categorical].astype('category')
cv_model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience_yrs',
])
X_train = cv_model.fit_transform(Food_train)
y_train = Food_train['Delivery_Time_min']
model = sm.OLS(y_train, X_train)
results = model.fit()
X_valid = cv_model.fit_transform(Food_valid)
y_valid = Food_valid['Delivery_Time_min']
valid_pred = results.predict(X_valid)
MSE = np.mean((y_valid - valid_pred)**2)
RMSE = np.sqrt(MSE)
print("MSE", MSE)
print("RMSE", RMSE)
MSE 48.8813195529826 RMSE 6.991517685952214
Leave-one-out cross-validation¶
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')
loocv_model = sklearn_sm(sm.OLS, MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience_yrs',
]))
X = Food.drop(columns=['Delivery_Time_min'])
y = Food['Delivery_Time_min']
cv_results = cross_validate(loocv_model, X, y, cv=Food.shape[0])
MSE = np.mean(cv_results['test_score'])
RMSE = np.sqrt(MSE)
print("MSE", MSE)
print("RMSE", RMSE)
MSE 44.70860072334475 RMSE 6.686449036921223
Leave-one-out cross-validation is more precise but the most expensive in terms of both time and computation.
This is because it requires the model to be trained multiple times, with each iteration leaving out a single data point for validation.
Indeed, the results obtained in terms of MSE and RMSE were slightly improved compared to the previous cross-validation, but it took approximately 30 seconds, compared to 0 seconds for the previous method.
K-fold cross-validation¶
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')
kf_model = sklearn_sm(sm.OLS, MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience_yrs',
]))
X = Food.drop(columns=['Delivery_Time_min'])
y = Food['Delivery_Time_min']
kf = KFold(n_splits=10, shuffle=True, random_state=1)
cv_results = cross_validate(kf_model, X, y, cv=kf)
MSE = np.mean(cv_results['test_score'])
RMSE = np.sqrt(MSE)
print("MSE", MSE)
print("RMSE", RMSE)
MSE 45.05819307596635 RMSE 6.712539986917497
K-fold cross-validation strikes a balance between precision and computational cost.
It involves splitting the dataset into k subsets, and iteratively training and validating the model on different combinations of these subsets.
The results obtained from k-fold are positioned between those from the 50%-data cross-validation (which yielded slightly worse results) and the leave-one-out, which provided slightly better results.
However, the execution time is much closer to that of the initial cross-validation, being far less computationally expensive than LOOCV.
Shrinkage Methods¶
We continue the analysis of the model using shrinkage methods.
These methods allow setting the coefficients of variables to zero.
To do this, it makes sense to return to a model with all variables.
This means adding back the categorical variables Time_of_Day and Vehicle_Type.
It is also important to normalize the model, as this ensures all variables are on the same scale.
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience_yrs'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
numerical_columns = ['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs']
scaler = StandardScaler()
X[numerical_columns] = scaler.fit_transform(X[numerical_columns])
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 59.5537 | 0.701 | 84.955 | 0.000 |
| Distance_km | 16.8426 | 0.228 | 73.852 | 0.000 |
| Weather[Foggy] | 6.3464 | 0.768 | 8.264 | 0.000 |
| Weather[Rainy] | 4.7687 | 0.593 | 8.044 | 0.000 |
| Weather[Snowy] | 9.8532 | 0.804 | 12.262 | 0.000 |
| Weather[Windy] | 2.8989 | 0.798 | 3.633 | 0.000 |
| Traffic_Level[Low] | -11.2475 | 0.623 | -18.066 | 0.000 |
| Traffic_Level[Medium] | -5.3621 | 0.620 | -8.653 | 0.000 |
| Time_of_Day[Evening] | -0.3425 | 0.591 | -0.580 | 0.562 |
| Time_of_Day[Morning] | -0.5390 | 0.581 | -0.927 | 0.354 |
| Time_of_Day[Night] | -1.4026 | 0.885 | -1.585 | 0.113 |
| Vehicle_Type[Car] | -0.2829 | 0.610 | -0.464 | 0.643 |
| Vehicle_Type[Scooter] | -0.7231 | 0.526 | -1.374 | 0.170 |
| Preparation_Time_min | 7.2708 | 0.228 | 31.929 | 0.000 |
| Courier_Experience_yrs | -1.5712 | 0.228 | -6.886 | 0.000 |
Ridge regression¶
We begin by splitting the database into train and test sets.
Next, we search for the optimal lambda for ridge regression.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
lambda_values = np.logspace(-2, 4, 300)
pipeline = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': lambda_values}
ridge_cv = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=10) # key-fold con 10 valori
ridge_cv.fit(X_train, y_train)
best_alpha = ridge_cv.best_params_['ridge__alpha']
print(f"Best Ridge lambda: {best_alpha}")
Best Ridge lambda: 1.7680381784572086
We evaluate the test set using the lambda found.
ridge_best = make_pipeline(StandardScaler(), Ridge(alpha=best_alpha))
ridge_best.fit(X_train, y_train)
ridge_pred = ridge_best.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)
coefficients = ridge_best.named_steps['ridge'].coef_
coeff_df = pd.DataFrame({'Variable': X_train.columns, 'Coefficient': coefficients})
print(coeff_df)
Variable Coefficient 0 intercept 0.000000 1 Distance_km 16.921252 2 Weather[Foggy] 2.048377 3 Weather[Rainy] 1.621683 4 Weather[Snowy] 3.064244 5 Weather[Windy] 1.014522 6 Traffic_Level[Low] -5.150850 7 Traffic_Level[Medium] -2.279150 8 Time_of_Day[Evening] 0.109579 9 Time_of_Day[Morning] -0.010385 10 Time_of_Day[Night] 0.175841 11 Vehicle_Type[Car] -0.082883 12 Vehicle_Type[Scooter] -0.197417 13 Preparation_Time_min 7.407239 14 Courier_Experience_yrs -1.444747
From the results, we see that the less significant variables (also observed in previous models and methods) have coefficients very close to zero for ridge regression.
This occurs because ridge regression shrinks coefficients of less relevant variables toward zero, improving model generalization.
We also notice that the intercept is exactly zero.
MSE = ridge_mse
RMSE = np.sqrt(MSE)
print("MSE", MSE)
print("RMSE", RMSE)
MSE 41.90806681166175 RMSE 6.473644013356137
The results in this case are similar to those obtained previously with the three cross-validation methods (50%, LOOCV, and KF), with a slight improvement of about 2% referring to the RMSE.
coefs = []
feature_names = X_train.columns
for a in lambda_values:
pipe = make_pipeline(StandardScaler(), Ridge(alpha=a))
pipe.fit(X_train, y_train)
coefs.append(pipe.named_steps['ridge'].coef_)
coefs = np.array(coefs)
plt.figure(figsize=(14, 8))
ax = plt.gca()
colors = plt.cm.tab20(np.linspace(0, 1, len(feature_names)))
for idx, name in enumerate(feature_names):
ax.plot(lambda_values, coefs[:, idx], label=name, color=colors[idx])
ax.set_xscale('log')
plt.xlabel('Lambda')
plt.axvline(best_alpha, color='r', linestyle='--')
plt.ylabel('Coefficients')
plt.title('Ridge Coefficients as Function of Regularization')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='small')
plt.tight_layout()
plt.ylim(-10, 20)
plt.xlim(1e-1, 1e4)
plt.show()
The results of the coefficients can also be seen in this graph, at the point corresponding to the red dashed line, which indicates the optimal lambda.
It is noticeable that the important variables are those farthest from zero.
It is also noticeable that ridge regression drives the coefficients toward zero as lambda increases.
mean_test_scores = -ridge_cv.cv_results_['mean_test_score']
alphas = ridge_cv.param_grid['ridge__alpha']
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_test_scores)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.xlabel('Lambda')
plt.ylabel('Mean Squared Error')
plt.title('Ridge Cross-Validation Error')
plt.ylim(48.63, 48.65)
plt.xlim(1e-1, 1e2)
plt.show()
In this other plot, we can see that the minimum MSE is found at the optimal lambda.
Lasso regression¶
We will now proceed in the same manner with lasso regression.
lasso_pipe = make_pipeline(StandardScaler(), Lasso(max_iter=10000))
lasso_cv = GridSearchCV(lasso_pipe,
{'lasso__alpha': lambda_values},
scoring='neg_mean_squared_error',
cv=10)
lasso_cv.fit(X_train, y_train)
best_alpha = lasso_cv.best_params_['lasso__alpha']
print(f"Best Lasso lambda: {best_alpha}")
Best Lasso lambda: 0.10553861030365236
lasso_best = lasso_cv.best_estimator_
lasso_pred = lasso_best.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
print(f"Number of initial coefficients: {len(feature_names)}")
final_coefs = lasso_best.named_steps['lasso'].coef_
print(f"Number of non-zero coefficients: {np.sum(final_coefs != 0)}")
coeff_df = pd.DataFrame({'Variable': feature_names, 'Coefficient': final_coefs})
print(coeff_df)
Number of initial coefficients: 15
Number of non-zero coefficients: 11
Variable Coefficient
0 intercept 0.000000
1 Distance_km 16.894242
2 Weather[Foggy] 1.879118
3 Weather[Rainy] 1.423975
4 Weather[Snowy] 2.893828
5 Weather[Windy] 0.823138
6 Traffic_Level[Low] -4.905907
7 Traffic_Level[Medium] -2.025254
8 Time_of_Day[Evening] 0.000000
9 Time_of_Day[Morning] -0.000000
10 Time_of_Day[Night] 0.048237
11 Vehicle_Type[Car] -0.000000
12 Vehicle_Type[Scooter] -0.084120
13 Preparation_Time_min 7.326051
14 Courier_Experience_yrs -1.307976
With lasso regression, the less significant variables are driven exactly to zero, allowing for model simplification.
Similarly, the intercept remains zero in this case as well.
MSE = lasso_mse
RMSE = np.sqrt(MSE)
print("MSE", MSE)
print("RMSE", RMSE)
MSE 42.21212509283608 RMSE 6.497085892370215
The results compared to ridge regression are practically identical.
coefs = []
for a in lambda_values:
pipe = make_pipeline(StandardScaler(), Lasso(alpha=a, max_iter=10000))
pipe.fit(X_train, y_train)
coefs.append(pipe.named_steps['lasso'].coef_)
coefs = np.array(coefs)
plt.figure(figsize=(14, 8))
ax = plt.gca()
colors = plt.cm.tab20(np.linspace(0, 1, len(feature_names)))
for idx, name in enumerate(feature_names):
ax.plot(lambda_values, coefs[:, idx], label=name, color=colors[idx])
ax.set_xscale('log')
plt.xlabel('Lambda')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as Function of Regularization')
plt.axvline(best_alpha, color='r', linestyle='--')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='small')
plt.xlim(1e-2, 1e2)
plt.tight_layout()
plt.show()
Unlike the same plot for ridge regression, we notice that in lasso regression, the coefficients go exactly to zero as the optimal lambda increases.
mean_test_scores = -lasso_cv.cv_results_['mean_test_score']
alphas = lasso_cv.param_grid['lasso__alpha']
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_test_scores)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.xlabel('Lambda')
plt.ylabel('Mean Squared Error')
plt.title('Lasso Cross-Validation Error')
plt.ylim(48, 50)
plt.xlim(1e-2, 1e1)
plt.show()
As with the optimal lambda for ridge regression, the optimal lambda for lasso regression is also at the minimum of the MSE function.
Classification Trees¶
Classification trees are decision tree models that split the data into regions based on input variables, aiming to predict the output by minimizing variability within each region.
We return to a non-normalized model, as normalization is not needed because tree-based methods are not sensitive to the scale of the variables.
As with previous methods, we split the data into train and test sets and use a DecisionTreeRegressor.
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience_yrs'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=1
)
tree_model = DecisionTreeRegressor(random_state=2)
tree_model.fit(X_train, y_train)
DecisionTreeRegressor(random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=2)
feature_importances = tree_model.feature_importances_
feature_names = tree_model.feature_names_in_
features_with_importances = list(zip(feature_names, feature_importances))
print("Regression Tree:")
print(f"Tree depth: {tree_model.get_depth()}")
print(f"Number of leaves: {tree_model.get_n_leaves()}")
print("Feature importances with names:")
for feature, importance in features_with_importances:
print(f"{feature}: {importance}")
Regression Tree: Tree depth: 15 Number of leaves: 391 Feature importances with names: intercept: 0.0 Distance_km: 0.7395694133585553 Weather[Foggy]: 0.007631080736828054 Weather[Rainy]: 0.00344948034208744 Weather[Snowy]: 0.012462360581793706 Weather[Windy]: 0.005930637070555485 Traffic_Level[Low]: 0.024642504880278127 Traffic_Level[Medium]: 0.002713006906142687 Time_of_Day[Evening]: 0.005121012228573312 Time_of_Day[Morning]: 0.0031360716319012547 Time_of_Day[Night]: 0.00165402628289708 Vehicle_Type[Car]: 0.001991788688687997 Vehicle_Type[Scooter]: 0.005642427790877079 Preparation_Time_min: 0.163347385261935 Courier_Experience_yrs: 0.022708804238887673
The resulting tree is very complex, with 15 levels and 391 leaves, suggesting strong overfitting.
Regarding variable importance information, the results are consistent with those observed in the previous models.
plt.figure(figsize=(50, 50))
plot_tree(tree_model,
feature_names=X.columns,
filled=True,
rounded=True,
proportion=True,
fontsize=12,
precision=2)
plt.title('Regression Tree', fontsize=16)
plt.show()
The plot confirms the enormous complexity of the tree.
y_pred = tree_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.4f}")
print(f"R2: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {np.sqrt(mse):.4f}")
MAE: 8.5419 R2: 0.6977 MSE: 124.6488 RMSE: 11.1646
The residual analysis shows worse results compared to the previous models, although worse performance was expected given the enormous size of the tree.
We proceed to simplify the tree by studying the maximum depth.
max_depth_values = range(1, 100)
cv_scores = []
for depth in max_depth_values:
tree_cv = DecisionTreeRegressor(criterion='squared_error', max_depth=depth, random_state=2)
scores = cross_val_score(tree_cv, X_train, y_train, cv=10, scoring='r2')
cv_scores.append(np.mean(scores))
plt.figure(figsize=(10, 6))
plt.plot(max_depth_values, cv_scores, 'o-')
plt.xlabel('Max Depth')
plt.ylabel('Cross-Validation R²')
plt.title('Cross-Validation Results for Tree Pruning')
plt.grid(True)
plt.xlim(0, 100)
plt.show()
The graph shows the effect of the decision tree depth on performance, using cross-validation and the R² score.
The results show an initial increase in R² followed by a decrease, indicating the optimization between underfitting and overfitting.
The optimal depth seems to be 4, as this is where R² is maximum.
best_depth = max_depth_values[np.argmax(cv_scores)]
print(f"Best depth from cross-validation: {best_depth}")
Best depth from cross-validation: 4
The optimal depth of 4 is confirmed, so we proceed with pruning.
pruned_tree = DecisionTreeRegressor(max_depth=best_depth, random_state=2)
pruned_tree.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=4, random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=4, random_state=2)
plt.figure(figsize=(50, 50))
plot_tree(pruned_tree,
feature_names=X.columns,
filled=True,
rounded=True,
proportion=True,
fontsize=13,
precision=2)
plt.title(f'Pruned Regression Tree (Max Depth = {best_depth})')
plt.show()
The pruned tree is much simpler and easier to explain.
The only variables considered in the end are Distance_km and Preparation_Time_min.
Dark orange indicate more extreme predicted values (higher or lower), while lighter or white colors indicate values close to the target mean.
These do not represent categories, but rather continuous value gradations.
y_pred_pruned = pruned_tree.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_pruned)
mse = mean_squared_error(y_test, y_pred_pruned)
r2 = r2_score(y_test, y_pred_pruned)
print(f"MAE: {mae:.4f}")
print(f"R2: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {np.sqrt(mse):.4f}")
MAE: 7.5838 R2: 0.7749 MSE: 92.8263 RMSE: 9.6346
With pruning, the results have also improved.
Bagging¶
We now try using ensemble methods for trees, starting with Bagging.
Bagging builds multiple trees on different random subsets of the data and averages their predictions.
This technique reduces variance and helps to improve model stability and accuracy.
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience_yrs'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=1
)
For Bagging, we use a Random Forest model with max features set equal to the total number of features.
This ensures that each tree is built using all available variables, maintaining the standard Bagging approach where randomness comes only from the data sampling, not from feature selection.
bagg_model = RandomForestRegressor(
n_estimators=500,
max_features=X.shape[1],
random_state=1,
bootstrap=True
)
bagg_model.fit(X_train, y_train)
RandomForestRegressor(max_features=15, n_estimators=500, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=15, n_estimators=500, random_state=1)
error_rate = []
X_test_array = X_test.to_numpy()
for i in range(1, 501, 10):
y_pred_bagg = np.mean([t.predict(X_test_array) for t in bagg_model.estimators_[:i]], axis=0)
error_rate.append(mean_squared_error(y_test, y_pred_bagg))
plt.figure(figsize=(10, 6))
plt.plot(range(1, 501, 10), error_rate, 'r-', lw=2, label='Bagging')
plt.xlabel('Number of Trees')
plt.ylabel('Test MSE')
plt.title('Test MSE vs Number of Trees (Bagging)')
plt.legend()
plt.grid(True)
plt.show()
In this graph, we see that as the number of trees increases, the MSE decreases.
In particular, it stabilizes from around 100 trees onward.
y_pred_bagg = bagg_model.predict(X_test)
mse_bagg = mean_squared_error(y_test, y_pred_bagg)
print(f"Bagging MSE: {mse_bagg:.4f}")
print(f"Bagging RMSE: {np.sqrt(mse_bagg):.4f}")
Bagging MSE: 65.7354 Bagging RMSE: 8.1077
The residual analysis shows better results compared to previous tree-based models.
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_bagg)
plt.plot([0, 100], [0, 100], 'k--')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Bagging: Actual vs Predicted')
plt.grid(True)
plt.show()
If the points lie on the bisector, the model is working well.
The model estimated with Bagging is accurate.
It performs better than a single tree but is still worse than linear models.
Some outliers are present.
importance_bagg = pd.Series(bagg_model.feature_importances_, index=X.columns)
importance_bagg.sort_values(ascending=False, inplace=True)
plt.figure(figsize=(10, 6))
importance_bagg.plot(kind='bar')
plt.title('Feature Importance (Bagging)')
plt.tight_layout()
plt.show()
In this other plot, we see that the variable Distance_km dominates, followed by Preparation_Time_min, although with less than half the importance of the first.
These results were expected considering the node values of the pruned tree built earlier.
Random forest¶
Using Random Forest, which is an extension of Bagging, we introduce an additional layer of randomness by selecting a random subset of features at each split, rather than considering all features.
This helps to reduce correlation between trees, improving the model's generalization and reducing overfitting compared to Bagging, where all features are used in each tree split.
forest_model = RandomForestRegressor(
n_estimators=100,
max_features='sqrt',
random_state=1
)
forest_model.fit(X_train, y_train)
RandomForestRegressor(max_features='sqrt', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features='sqrt', random_state=1)
error_rate_rf = []
X_test_array = X_test.to_numpy()
for i in range(1, 501, 10):
y_pred_rf = np.mean([t.predict(X_test_array) for t in forest_model.estimators_[:i]], axis=0)
error_rate_rf.append(mean_squared_error(y_test, y_pred_rf))
plt.figure(figsize=(10, 6))
plt.plot(range(1, 501, 10), error_rate_rf, 'g-', label='Random Forest')
plt.plot(range(1, 501, 10), error_rate, 'r-', lw=2, label='Bagging')
plt.xlabel('Number of Trees')
plt.ylabel('Test MSE')
plt.title('Test MSE vs Number of Trees')
plt.legend()
plt.grid(True)
plt.show()
In our case, as seen from the plot, Random Forest performs worse than Bagging.
This could be due to the additional feature randomness in Random Forest, which may lead to less effective splits compared to Bagging, where all features are considered at each split.
y_pred_rf = forest_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest MSE: {mse_rf:.4f}")
print(f"Random Forest RMSE: {np.sqrt(mse_rf):.4f}")
Random Forest MSE: 84.8856 Random Forest RMSE: 9.2133
The results being worse are confirmed by the residual analysis.
importance_rf = pd.Series(forest_model.feature_importances_, index=X.columns)
importance_rf.sort_values(ascending=False, inplace=True)
plt.figure(figsize=(10, 6))
importance_rf.plot(kind='bar')
plt.title('Feature Importance (Random Forest)')
plt.tight_layout()
plt.show()
With Random Forest, we force the other variables to play a more significant role by randomly selecting subsets of features for each tree split.
This randomness encourages the model to explore different feature combinations, ensuring that no single feature dominates the decision-making process, as opposed to Bagging, where all features are used in each split.
Boosting¶
As the last tree-based method, we use Boosting.
Boosting is an ensemble technique that builds multiple trees sequentially, where each tree corrects the errors of the previous one.
boost_model = GradientBoostingRegressor(
n_estimators=5000,
max_depth=4,
learning_rate=0.001,
loss='squared_error',
random_state=1
)
boost_model.fit(X_train, y_train)
GradientBoostingRegressor(learning_rate=0.001, max_depth=4, n_estimators=5000,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.001, max_depth=4, n_estimators=5000,
random_state=1)y_pred_boost = boost_model.predict(X_test)
mse_boost = mean_squared_error(y_test, y_pred_boost)
print(f"Boosting MSE: {mse_boost:.4f}")
print(f"Boosting RMSE: {np.sqrt(mse_boost):.4f}")
methods = ['Bagging', 'Random Forest', 'Boosting']
mse_values = [mse_bagg, mse_rf, mse_boost]
Boosting MSE: 60.2433 Boosting RMSE: 7.7617
Based on the results, it appears that Boosting is the most effective tree-based method.
However, it still underperforms compared to the initial linear regression models, which remain the most accurate for this dataset.
plt.figure(figsize=(10, 6))
plt.bar(methods, mse_values)
plt.ylabel('Test MSE')
plt.title('Comparison of Tree-Based Methods')
for i, v in enumerate(mse_values):
plt.text(i, v + 0.1, f"{v:.4f}", ha='center')
plt.tight_layout()
plt.show()
The graph compares the performance of three ensemble methods: Bagging, Random Forest, and Boosting. As shown, Boosting achieving slightly better results.